-
Notifications
You must be signed in to change notification settings - Fork 300
NetCDF save with dask support #2411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pp-mo I assume that I now just re-target this PR against |
| else: | ||
| array = ma.masked_array(array, mask=mask, | ||
| fill_value=fill_value) | ||
| return array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo I've combined my array_nans_to_filled and array_nans_to_masked and opted to use the filled kwarg, rather than have separate but very similar functions.
Interested on your take with my raising an exception for the filled = True case but fill_value = None ... seems the right thing to do.
Also, that this routine will perform dtype casting ... again, seems to me like the best place to do it, but whether that capability wrapped up within array_nans_to_masked is appropriate for a higher level function that calls it, such as your as_concrete_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, mine is very very similar here.
Seems like a case of "furious agreement"
| self._numpy_array = array_nans_to_masked(data, | ||
| self.fill_value, | ||
| self.dtype) | ||
| self.dtype = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo I'm expecting you to have stomped over this with your PR, but hopefully they align in logic ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully they align in logic
Dead right, mine is spookily similar !
pp-mo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not muich to argue over here, so check it out
lib/iris/_lazy_data.py
Outdated
| # Finally, mask or fill the data, as required. | ||
| if np.any(mask): | ||
| if filled: | ||
| if fill_value is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually work...
if filled:
if fill_value is None:
...
I think it needs to be
if filled is not False:
if filled is None:
lib/iris/_lazy_data.py
Outdated
| # First, calculate the mask. | ||
| mask = np.isnan(array) | ||
| # Now, cast the dtype, if required. | ||
| if dtype is not None and dtype != array.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for efficiency's sake (aargh!!) this is better done after filling -- we don't want to do a masked 'astype' when it isn't needed.
lib/iris/_lazy_data.py
Outdated
| if np.any(mask): | ||
| if filled: | ||
| if fill_value is None: | ||
| emsg = 'Invalid fill value, got {!r}.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As-is there;'s not much point in
if fill_value is None:
'Invalid fill value, got {!r}.'
It might as well say the exact truth.
E.G. "Dask result contains missing data, but no fill value was provided."
| self._numpy_array = array_nans_to_masked(data, | ||
| self.fill_value, | ||
| self.dtype) | ||
| self.dtype = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully they align in logic
Dead right, mine is spookily similar !
| else: | ||
| array = ma.masked_array(array, mask=mask, | ||
| fill_value=fill_value) | ||
| return array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, mine is very very similar here.
Seems like a case of "furious agreement"
lib/iris/_lazy_data.py
Outdated
| return array | ||
|
|
||
|
|
||
| def array_nans_to_masked(array, fill_value=None, dtype=None, filled=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name probably needs fixing ?
lib/iris/_lazy_data.py
Outdated
|
|
||
| def array_nans_to_masked(array, fill_value=None, dtype=None, filled=False): | ||
| """ | ||
| Convert an array into a masked array, by masking any NaN points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs fix: it does not always return a masked array.
(nor should it, IMHO)
|
I think there are three useful usecases
(+ in future we might consider a "don't even check for NaNs" option for performance-only-reasons) |
|
Current + possible future call signatures for the above usecases
current calls
future ideas ?
The above still allows you to specify the fill_value of a masked array it creates.
|
Agreed use case, interesting part is whether we care to set the If a user plucks a concrete masked data payload from the cube, for whatever reason, then they can set the I guess I'm quickly coming to the conclusion that I can't see a use case for why we would want to set the
Yup ... and fill them with the
@pp-mo I'm struggling to see when this might be required ... are you thinking of
Yup, but not now, right? That's an optimisation that I'd love for us to be in that "happy place" to make, but now just right now. |
Yes, that's exactly what I'm thinking of. |
237e124 to
6d2f381
Compare
6d2f381 to
570f2ca
Compare
| # masked result, but it ensures we use a "filled" version of the | ||
| # input in this case. | ||
| if cube.fill_value is not None: | ||
| source_data.fill_value = cube.fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way trajectory handles masks is pretty ropey ... this change appears appropriate.
The preceding inline code comment says it all really ...
lib/iris/_lazy_data.py
Outdated
| # Check the fill value is appropriate for the | ||
| # target result dtype. | ||
| try: | ||
| [fill_value] = np.asarray([nans], dtype=result_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the dtype of the array not the result_dtype ...
lib/iris/_lazy_data.py
Outdated
| [fill_value] = np.asarray([nans], dtype=result_dtype) | ||
| except OverflowError: | ||
| emsg = 'Fill value of {!r} invalid for result {!r}.' | ||
| raise ValueError(emsg.format(nans, result_dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this should use array.dtype not result_dtype
|
Downloads to travis seem to be getting throttled at the moment ... which is painful 😱 |
lib/iris/_lazy_data.py
Outdated
| * nans: | ||
| If `nans` is None, then raise an exception if the `array` contains | ||
| any NaN values (default). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nans is None is never used so why did you include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the default in the function formal parameter list.
Also, it's a very useful value to default to because it's gonna raise an exception if it detects that there are NaNs in your data. We're forcing people to care about this. So for example, this is good behaviour to adopt as it will force the issue in cases when streamed saving is being attempted with NaN data, but the fill_value isn't set i.e. None. See here in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nans is None is never used
But it is here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it is here ?
I meant when convert_nans_array is used. but I forgot that the fill_value of a cube could be set to None as @bjlittle's comment points out
lib/iris/_lazy_data.py
Outdated
| * nans: | ||
| If `nans` is None, then raise an exception if the `array` contains | ||
| any NaN values (default). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think nans is descriptive of the purpose of this keyword argument. Perhaps nans_replacement would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take on it is, what are my NaN going to be replace with
nans=fill_valueis replace NaNs with thefill_valuenans=ma.maskedis replace my NaNs with a masknan=Noneis I don't care, there shouldn't be any NaNs in my data, barf if there is
I'm open to a group take on this, as I've been staring at it way too long ... but I'm kinda wedded to the short and sweet nans name ... thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that's two votes for change ... that's enough for me ... nans_replacement it is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nans_substitute is one character shorter if you prefer that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. First impressions last and nans_replacement is perfect.
| raise ValueError(emsg) | ||
| elif nans is ma.masked: | ||
| # Mask the array with the default fill_value. | ||
| array = ma.masked_array(array, mask=mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this uses the default fill_value irrespective of what the fill value on the cube is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. We no longer care. The fill_value that we do care about lives on the cube.
| self.assertEqual(result.fill_value, expected.fill_value) | ||
|
|
||
| def test_nans_filled(self): | ||
| fill_value = 666.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👿
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Muh-ha-ha-har!
I'm looking at it now, but I don't claim to be in charge. |
|
Re:
I'm think I'm happy with the routine function, but I will support better naming of arguments ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the changes in existing code.
There are a few awkward corners in the behaviour of the core routinem + its testing, which I think at least deserve documenting.
| If `nans_replacement` is None, then raise an exception if the `array` | ||
| contains any NaN values (default behaviour). | ||
| If `nans_replacement` is `numpy.ma.masked`, then convert the `array` | ||
| to a :class:`~numpy.ma.core.MaskedArray`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: something missing here.
We should explain that in 'masked array' mode the input array is copied, but in "fill or no-nans modes", the result is the input array, which may be modified in-place.
may be modified in-place
P.S. except when we do a change of dtype.
A bit slippery that, isn't it 😬 ??
Perhaps should just say "in some cases, the input is modified in-place" + leave it at that ?
| self.assertIs(result, self.array) | ||
| expected = np.array([[1.0, fill_value], | ||
| [3.0, 4.0]]) | ||
| self.assertArrayEqual(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also check that result is the input (no copy, fill-in-place)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo Well, that's already done on line 80 ... 👓
| expected = np.array([[1, fill_value], | ||
| [3, 4]], | ||
| dtype=dtype) | ||
| self.assertArrayEqual(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, result is not the original (not in-place with new dtype).
P.S. did you know you can assign array.dtype ? It does like a C pointer cast 😱 .
| self.assertIsInstance(result, ma.MaskedArray) | ||
| self.assertIs(result, array) | ||
|
|
||
| def test_pass_thru_masked_array_integer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Un -masked ints are pass-through anyway.
Should be checking that.
It's in the docstring!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo Yup, good spot. Done!
| fill value. | ||
| * result_dtype: | ||
| Cast the resultant array to this target :class:`~numpy.dtype`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not happen with pass-through cases (anything already masked, or integer types).
We should explain that here, somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo This is actually mentioned in the doc-string note. I think that is sufficient.
| result = array.data.copy() | ||
| result[mask] = np.nan | ||
| else: | ||
| result = array.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer This addresses the following numpy warning:
MaskedArrayFutureWarning: setting an item on a masked array which has a shared mask will not copy the mask and also change the original mask array in the future.
Check the NumPy 1.11 release notes for more information.
Which was generated by array[mask] = np.nan whenever the data was masked, but the dtype was not integral.
|
closes #2347 |
| <?xml version="1.0" ?> | ||
| <cubes xmlns="urn:x-iris:cubeml-0.2"> | ||
| <cube standard_name="eastward_wind" units="m s-1" var_name="wind1"> | ||
| <cube core-dtype="int32" dtype="int32" standard_name="eastward_wind" units="m s-1" var_name="wind1"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we need to take out these core-dtype from the cml.
Perhaps at the end of this sprint (considering we want to target fill_value/dtype/get-tests-working this sprint)
|
All comments have now been addressed so I'm gonna merge this in |
|
@lbdreyer Awesome, thanks! |
* NetCDF save with dask support * Refactor and use array_nans_to_masked * Add array_nans_to_masked test coverage. * Refactor. * Working tests. * Update comment detail for convert usage. * Use array.dtype in convert_nans_array * Review actions. * Review actions.
Support for streamed netCDF saving using
dask.Introduces the convenience function
iris._lazy_data.convert_nans_arrayto deal with converting a NaN array to either a masked array of a filled ndarray.